========================================================
I’ve chosen to investigate the Red Wine Quality dataset. First let’s take a look at the summary statistics:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Next, it’s nice to visualize the features to see the shape of their distributions:
Acidities, chlorides, density, pH, and quality have roughly normal distributions. Citric acid, sugar, sulfurs, sulphate, and alcohol have more long-right-tail distributions, more poisson in shape. Nothing too out of the ordinary.
There are 1599 wines tested for 11 continuous variables and one discrete variable (quality). The data contains no categorical variables. There are no missing values in the data. Mean quality score is 5.636 with a standard deviation of 0.80. Scores of 8 and 3 are therefore about 3 standard deviations from the mean. Plotting a histogram of scores, i.e. quality, I see that there are no scores below 3 or above 8, and the distribution of scores is roughly normal.
Quality is clearly the most important variable, as it will be the outcome of the linear model I’ll crete later on. I don’t really see how we could choose the important features based on a univariate analysis, at least in this particular dataset. Based on a correlation analysis (shown in the bivariate section), alcohol content and volatile acidity are the most influential. Alcohol ranges from 8.4 to 14.9% with a peak around 9.5% and an asymptotic decline from there. Volatile acidity ranges from 0.12 to 1.58g/L, with a mean of 0.53. Its distribution is more normal than alcohol’s. It may be helpful to log transform alcohol to get a little more out of the long right tail:
Log transforming doesn’t seem useful. I guess the tail isn’t really long enough for a transform to be meaningful. Abort log transform.
Based on correlation, citric acid and sulphates are somewhat significant features. Citric acid is unique in that it has mode of 0, yet a fairly broad range, suggesting (perhaps) that it doesn’t affect flavor that much, or at least not in a way that affects quality perception. I doubt it will be much of a player. Density and residual sugar may be somewhat supportive features as well, based on correlation, and the rest will probably be unimportant.
One. I created a variable called “aqueous density”, which is the density of the residual wine after fractioning out the alcohol content. I wanted to explore why density is less correlated with quality (-0.175) than alcohol is (0.476), even though alcohol is the main effector of density and alcohol and density are themselves strongly correlated (-0.496). I suspected that something that increases density also positively impacts quality. To create the new variable, I took the density of alcohol at 0.789 kg/L and the alcohol content of each wine and figured out what the density of each wine would be minus the pure alcohol fraction. I used the following equation: 1-(alcohol/100) * aqueous.density + alcohol/100 * 0.789 = density
The correlation between aqueous density and quality is 0.377, making it more than twice as strong a predictor of quality than the original density variable is at -0.175. Aqueous density edges out sulphate and citric acid to be the third strongest predictor of quality, nearly tied with the second-place predictor volatile acidity. However, being a derivative feature, and knowing that aqueous density is mainly a function of the acids, chlorides and sulfurs in the wine, it will probably not be helpful once the other factors are added into a predictive model.
I also factorized the discrete variable quality so that I could easily visualize quality scores as individual boxes in ggplot boxplots. Not entirely necessary, but since the dataset had no categorical variables of its own, it seemed a nice pedagogical exercise.
All distributions looked roughly normal (or slightly poisson), with nothing particularly interesting: no bimodal distributions or anything like that. There was no need to tidy or adjust the data since it was clean and without missing values. The closest thing to a “transformation” I did was the categorization of quality into factor levels, which is trivial.
Pairwise plotting of the variables to visualize the relationships:
Hard to make much sense of it at this resolution, but it’s quite informative when blown up.
Note in this 2nd correlation plot I have rearranged the order of variables using a hierarchical clustering algorithm to show which groups of features are most strongly correlated to one another. I’ll discuss this matrix below.
Bivariate scatterplots are a logical way to compare the relationships of features to quality:
Alcohol is strongly correlated with density at -0.496, as expected since alcohol is less dense than water. What is interesting, however, is that density is less highly correlated to quality than alcohol is. This suggests: other factors that also lower density also lower score rather than raise it (perhaps residual sugar, which would make wine more dense?).
The strongest effects on quality are high alcohol content (corr = 0.476), and low volatile acidity (corr = -0.391) (thus these pairwise scatterplots have the steepest fit line slopes). Citric acid and sulphates appear to play minor positive roles in quality, and the other variables appear to be uninfluential.
pH and fixed acidity are strongly correlated (-0.683); a nice sanity check: high acidity is the definition of low pH, after all.
pH and density appear to be mirror images of one another (strange), corr -.342. By chance, or investigate further?
Fixed acidity and citric acid are highly correlated at 0.672, which chemically would be expected since citric acid lowers pH. I did some research and apparently citric acid is the 4th most common fixed acid in wine, after malic, tartaric and succinic, so citric acid is a subset of fixed acid.
Citric acid and volatile acidity are strongly negatively correlated. This is unexpected from a naive standpoint. Reading about the dataset, I see that volatile acidity means primarily acetic acid, or vinegar. Perhaps a wine with high citric acid must be low in acetic acid or the wine will present as too sour overall, so a lot of one necessitates less of the other. This would explain the negative correlation. Just a shot in the dark.
Sulfur and free sulfur are highly correlated at 0.668, which is basically obvious and another good sanity check on the data (free sulfur is a subset of sulfur).
The strongest correlations were pretty obvious ones:
No major relationships emerged that weren’t basically expected, except this one:
I wouldn’t have thought half the variation in quality could be explained by alcohol %. I wonder though if alcohol % is really a causative agent, or whether it’s a longer fermentation process that leads to both higher alcohol content and more flavor production/complexity as well. I suspect the latter. Otherwise, we’d all be drinking red vodka.
Here I’m taking a bit of a turn in my analysis. I realized that the red wine dataset on its own is pretty boring: entirely numeric variables, no time series, no discrete variables except for the quality scores. No categorical variables to play with. Yes, I could create categorical variables by binning continuous variables, but there’s no compelling reason I see why that wouldn’t be contrived.
The red wine dataset basically only lends itself to bivariate scatterplots. Multivariate plots of two features for X and Y with quality as color have proven boring and uninformative (I tried, not worth showing). A linear model to predict wine scores comes up with an unimpressive R^2 of 0.35, and the plots reflect this lack of strong trend and correlation.
To make this analysis more interesting and a better learning exercise, I’m adding in white wine data, merging datasets, and comparing reds to whites. The new “feature of interest” is no longer wine quality, but rather wine color: more specifically, trying to use plots to communicate the most significant differences between red and white wine. I’ll start by showing how I merged datasets and then dive into a little bivariate plotting of reds vs whites, then move into multivariate plots.
wineRed <- read.csv("wineQualityReds.csv")
wineWhite <- read.csv("wineQualityWhites.csv")
#create a "color" column to distinguish reds from whites after merging datasets
wineWhite$color <- "white"
wineWhite$X <- wineWhite$X + 1599 #unique row numbers > red numbers
wineRed$color <- "red"
#merge datasets
allWine <- rbind(wineRed, wineWhite)
#make sure row number is not misinterpreted as numeric
allWine$X <- as.factor(allWine$X)
WinePalette <- c('violetred4', 'gold1') #color reds red, whites yellowish
Now that we have a merged dataset, frequency plots to compare red vs white across all variables:
The main differences I notice are a lot more residual sugar and sulfur in white wine. Red wine, on the other hand, has slightly more sulphate, much more chloride, and higher fixed and volatile acidity. Red wine is more dense despite white wine’s sugar content. The two have comparable alcohol contents and a similar distribution of scores, which might indicate the tasters have done a good job of “normalizing” their palettes across varietals.
Despite clear differences in most of these distributions, there’s not a single variable within which there is not significant overlap of red and white wines. In other words, there is no single variable that one could use as a means of confidently classifying wine as red or white.
Let’s see how different white wine’s correlation matrix looks:
Primarily, I see a strong correlation between residual sugar and density that was not present in red wines. White wine has much more residual sugar, and this increases density.
Boxplots to compare red vs white wine characteristics:
I think these boxplots complement the histograms nicely. We can easily see that volatile acidity, chlorides, sulfurs, and sulphates have non-overlapping interquartile ranges. They would be good candidates for red vs white classifiers. Let’s explore a few pairwise scatterplots of those features:
We see in these plots that given sulfur and sugar levels, we could do a pretty good job of guessing if a wine is red or white, though some of the red wine points are buried under white wine points, showing imperfect separation. What about using chlorides and volatile acidity?
Volatile acidity is not the best separator, too much overlap in the range 0.2 to 0.6. Let’s try total sulfur dioxide vs chlorides:
Now that’s a nice separation. We can see exactly the line along which a logistic regression algorithm would pretty successfully separate reds and whites based on these two features alone.
Do we notice any trend in quality as a function of sulfur and chloride levels?
Not really. What about alcohol vs vinegar levels?
Yes! This is actually a pretty good separation of high-scoring and low-scoring wines. Good wines have high alcohol / low vinegar, and vice versa for bad wines.
I’m worried the middle-tiered wines are overplotting our few very nice wines, so I want to replot only the wines scoring 3,4,8 or 9 and see how it looks:
Not a bad separation. Drinking alcohol is preferable to drinking vinegar. Slightly higher volatile acidity is admissible if alcohol levels are correspondingly high.
Is there a difference here between red and white wines?
The distribution for red and white wines seems pretty similar.
This was much more interesting than looking at one wine type alone. Red and white wine can be effectively separated in a 2-D scatterplot using a range of different feature pairs, particularly those features chosen which had the least overlapping ranges as shown in the boxplots. 3-D scatterplots would be fun to play with as well. The last plot shown definitely shows alcohol and volatile acidity strengthening one another, as a nice separation between red and white wine can be seen in these two dimensions. Building a classifier using a support vector machine etc. would be a natural next step in this analysis, but is a little out of scope of this report.
My hypothesis going in was that residual sugar was responsible for the increase in aqueous density that correlated with increased quality. I was wrong. In fact, in red wines aqueous density correlates much more strongly with fixed acidity, to the tune of 0.492. When I fit a linear model, adding fixed acidity makes the aqueous density feature irrelevant, so it appears aqueous density is mostly just a stand-in for fixed acidity.
In white wines I did not create an aqueous density feature, however based on my sugar hypothesis, we’d expect white wines to be more dense than reds. They have much more residual sugar and similar alcohol levels to reds. However, white wines are less dense despite the added sugar. Higher acid levels in red wines, particularly fixed acids, are probably the reason for their higher density.
I fit two linear models: one to predict quality of red wines, and the other to distinguish between red and white wines. Both made use of R’s built-in step analysis.
The red wine quality model has an R^{2} of 0.3595, which is pretty bad. The features that ended up being significant were chlorides, sulphates, alcohol, volatile acidity, pH, and free and total sulfur. Some of these relationships could not have been anticipated based on the pairwise correlations. I especially did not expect pH or free sulfur to be included, as both had correlations of 0.05 with quality. This is a good lesson in not choosing your model based on pairwise correlations alone.
##
## Call:
## lm(formula = quality ~ chlorides + sulphates + alcohol + volatile.acidity +
## pH + free.sulfur.dioxide + total.sulfur.dioxide, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68918 -0.36757 -0.04653 0.46081 2.02954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4300987 0.4029168 10.995 < 2e-16 ***
## chlorides -2.0178138 0.3975417 -5.076 4.31e-07 ***
## sulphates 0.8826651 0.1099084 8.031 1.86e-15 ***
## alcohol 0.2893028 0.0167958 17.225 < 2e-16 ***
## volatile.acidity -1.0127527 0.1008429 -10.043 < 2e-16 ***
## pH -0.4826614 0.1175581 -4.106 4.23e-05 ***
## free.sulfur.dioxide 0.0050774 0.0021255 2.389 0.017 *
## total.sulfur.dioxide -0.0034822 0.0006868 -5.070 4.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared: 0.3595, Adjusted R-squared: 0.3567
## F-statistic: 127.6 on 7 and 1591 DF, p-value: < 2.2e-16
Model 2 used color as outcome, essentially using linear regression as a classifier. We get an R^2 of 0.825 using 5 features, which means we can do a pretty good job of classifying red vs white wine based on just a few regressors. Chlorides and sulfur alone produce a model with an R^2 of 0.60. I bet, however, that outliers throw off the model and that logistic regression or a SVM algorithm would do an even better job distinguishing red from white.
##
## Call:
## lm(formula = colorAsInt ~ sulphates + density + volatile.acidity +
## residual.sugar + total.sulfur.dioxide, data = allWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82507 -0.10907 -0.00472 0.10100 1.52009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.042e+01 1.103e+00 -63.82 <2e-16 ***
## sulphates 3.618e-01 1.712e-02 21.13 <2e-16 ***
## density 7.115e+01 1.117e+00 63.72 <2e-16 ***
## volatile.acidity 6.668e-01 1.621e-02 41.13 <2e-16 ***
## residual.sugar -3.165e-02 7.453e-04 -42.46 <2e-16 ***
## total.sulfur.dioxide -3.072e-03 5.024e-05 -61.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1803 on 6491 degrees of freedom
## Multiple R-squared: 0.825, Adjusted R-squared: 0.8249
## F-statistic: 6122 on 5 and 6491 DF, p-value: < 2.2e-16
I had already made a simple scatterplot of alcohol vs quality in red wine, but I wanted to try different plot styles. I found I much preferred this plot because it adds a lot of information (quartiles and outliers) and yet still presents as cleaner than a basic scatterplot. It makes it much easier to see the marginal increase in alcohol for each one-point rise in quality, and notice that this marginal increase only exists for wines with a score of 5 or above. And yet, the plot still preserves the detailed information held in the individual points, overlayed with high transparency so as not to impose too much on the boxes.
This plot demonstrates how influential high alcohol content and low volatile acidity are to the perceived quality of wine: higher alcohol and lower volatile acidity clearly correlate with an increase in wine quality. Separating white and red wine and stacking them vertically while maintaining consistent coordinates, it’s easy to communicate that the trend holds across wine category. Removing average wines, those scored at 5 or 6, prevents overplotting and helps highlight the difference between great wines and terrible wines.
This plot demonstrates the differences in sulfur and chloride levels between red and white wine. White wine has relatively more sulfur; red has more chloride. Visually we can see that one could do a pretty good job of predicting a wine’s color knowing only these two variables.
I would say that I had a lot more fun once I pooled red and white datasets and started to do a comparative analysis. Having a categorical variable to play with opened up the field of possibilities in terms of plot types and use of color. I played around with multivariate plots using only the red wine data, but failed to reveal any really compelling, non-obvious visual trends.
The biggest challenge I encountered in this project was figuring out how to customize charts using ggplot2. All in all, I have come to love ggplot, particularly how easy it is to produce a basic, attractive plot with a single line of code. But the flip side of the ease of automation is the difficulty of taking over manual control. To give just one example, I found it difficult to control the display of outlier points in my box plots.
As a followup to this exploration, I’d like to do a logistic regression classifier and see how well I can separate red and white wines. It looks from the plots like it shouldn’t be too hard to build a good classifier.
I found myself very much wanting time series data of some sort so that I could play around with line plots. I wonder if there’s any data out there on wine that includes age. The vintage of the wines tested would have been a nice additional feature to explore.